Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: fix racy UnsafeCell access on a closed oneshot #4226

Merged
merged 8 commits into from
Nov 14, 2021

Conversation

hawkw
Copy link
Member

@hawkw hawkw commented Nov 12, 2021

Motivation

Issue #4225 describes a potential race condition in the oneshot
channel in tokio-sync. The race occurs when the oneshot is closed, and
then a thread calls oneshot::Sender::send while another thread is
calling oneshot::Receiver::try_recv or awaiting the reciever. The call
to send puts the sent value in the cell. Then, it sets the
VALUE_SENT (completed) bit, returning a snapshot of the state. If the
state is closed, the sender takes the sent value back out of the cell.
Receiving from the channel also loads the state. The reciever first
checks if the channel is completed, and takes the value out of the cell
if the completed bit is set. This is racy, because if a call to recv
loads the state after the completed bit has been set, it will no longer
treat the channel as closed, and will try to take the value out of the
cell. This results in both threads accessing the cell concurrently.

Solution

This branch adds two loom tests reproducing the race, for try_recv
and for awaiting a Receiver. Both of these tests fail with a causality
violation when accessing the UnsafeCell on master.

I fixed the race condition by changing State::set_completed to not set
the VALUE_SENT bit if the CLOSED bit has already been set. This way,
if the channel is closed before send is called, we won't set the bit
that tells the receiver that it's okay to access the UnsafeCell, and
we the sender can take the value back out without the risk of the
receiver also concurrently accessing the UnsafeCell.

This required changing State::set_completed from a simple fetch-or to
a compare-and-swap loop, which is slightly a bummer, as it's less
efficient than a single fetch-or. However, as far as I can tell,
changing this to a CAS loop is the only solution to the race that
doesn't also change the oneshot's behavior. We currently have tests
asserting that if the channel is closed after a value is sent, the
value is still received. A simpler solution to this problem would be
changing poll_recv and try_recv to check if the channel is closed
first, and only check if the channel is completed and take the value
out if it is not closed. This fixes the race, since the receiver will no
longer try to access the UnsafeCell if the channel has been closed.
But, this would break the existing tests asserting that calling close
after a value has been sent still results in the value being received.
Changing that behavior seems undesirable, so I thought this was worth
the CAS loop.

Fixes: #4225

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This fixes the race.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
@hawkw hawkw added C-bug Category: This is a bug. M-sync Module: tokio/sync I-crash Problems and improvements related to program crashes/panics. R-loom Run loom tests on this PR labels Nov 12, 2021
@hawkw hawkw requested review from carllerche and Darksonn November 12, 2021 18:36
@hawkw hawkw self-assigned this Nov 12, 2021
@github-actions github-actions bot removed the R-loom Run loom tests on this PR label Nov 12, 2021
Comment on lines +1048 to +1051
loop {
if State(state).is_closed() {
break;
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also considered writing this as

Suggested change
loop {
if State(state).is_closed() {
break;
}
while !State(state).is_closed() {

but i had a vague memory of @carllerche saying that he doesn't like the use of while loops for compare-and-swap loops... :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way it's written here with a loop is perfectly readable, so let's just keep it.

@Darksonn Darksonn added the R-loom Run loom tests on this PR label Nov 12, 2021
Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code. What you are doing here sounds reasonable, but I'd need to read the entire implementation to be confident that it solves the problem.

Comment on lines +1048 to +1051
loop {
if State(state).is_closed() {
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way it's written here with a loop is perfectly readable, so let's just keep it.

@hawkw
Copy link
Member Author

hawkw commented Nov 12, 2021

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code.

Yeah, I was also thinking about trying to clean up more of the existing implementation (maybe in a follow-up...). But, I'm pretty confident that this change fixes the specific crash in question.

// `poll_recv` or `try_recv` call is occurring concurrently, both
// threads may try to access the `UnsafeCell` if we were to set the
// `VALUE_SENT` bit on a closed channel.
let mut state = cell.load(Ordering::Relaxed);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems safer.

Suggested change
let mut state = cell.load(Ordering::Relaxed);
let mut state = cell.load(Ordering::Acquire);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary; the compare_exchange has an Acquire failure ordering, so when we actually go to set the value, we will always perform an Acquire operation, and if the initial Relaxed load is stale, the CAS will fail, and we'll loop again with a value that was accessed with an Acquire ordering. So, making the initial load Relaxed really just serves to avoid an unnecessary Acquire in the case that we're the only thread writing to the state; if it's contended and the Relaxed load is stale, we will always perform an Acquire on subsequent iterations.

Other CAS loops elsewhere in Tokio follow a similar pattern:

// Quick initial debug check to see if the timer is already fired. Since
// firing the timer can only happen with the driver lock held, we know
// we shouldn't be able to "miss" a transition to a fired state, even
// with relaxed ordering.
let mut cur_state = self.state.load(Ordering::Relaxed);
loop {
debug_assert!(cur_state < STATE_MIN_VALUE);
if cur_state > not_after {
break Err(cur_state);
}
match self.state.compare_exchange(
cur_state,
STATE_PENDING_FIRE,
Ordering::AcqRel,
Ordering::Acquire,
) {
Ok(_) => {
break Ok(());
}
Err(actual_state) => {
cur_state = actual_state;
}
}

However, I'm happy to change this if you prefer.

@hawkw hawkw requested a review from Darksonn November 12, 2021 20:32
@hawkw
Copy link
Member Author

hawkw commented Nov 12, 2021

As a note, I can run my repro from #4225 (comment) with this branch for over 40 minutes without seeing any segfaults or mangled strings.

@hawkw
Copy link
Member Author

hawkw commented Nov 13, 2021

The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code.

Yeah, I was also thinking about trying to clean up more of the existing implementation (maybe in a follow-up...). But, I'm pretty confident that this change fixes the specific crash in question.

#4229 should help make some of the oneshot implementation's invariants a bit clearer, I hope!

@Darksonn Darksonn merged commit 4b78ed4 into master Nov 14, 2021
@Darksonn Darksonn deleted the eliza/oneshot-race branch November 14, 2021 17:03
hawkw added a commit that referenced this pull request Nov 15, 2021
Depends on #4226

## Motivation

Currently, the safety invariants and synchronization strategy used in
`tokio::sync::oneshot` are not particularly obvious, especially to a new
reader. It would be nice to better document this code to make these
invariants clearer.

## Solution

This branch adds `SAFETY:` comments to the `oneshot` channel
implementation. In particular, I've focused on documenting the
invariants around when the inner `UnsafeCell` that stores the value can
be accessed by the sender and receiver sides of the channel.

I still want to take a closer look at when the waker cells can be set,
and I'd like to add more documentation there in a follow-up branch.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Nov 15, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
hawkw added a commit that referenced this pull request Nov 15, 2021
Depends on #4226

## Motivation

Currently, the safety invariants and synchronization strategy used in
`tokio::sync::oneshot` are not particularly obvious, especially to a new
reader. It would be nice to better document this code to make these
invariants clearer.

## Solution

This branch adds `SAFETY:` comments to the `oneshot` channel
implementation. In particular, I've focused on documenting the
invariants around when the inner `UnsafeCell` that stores the value can
be accessed by the sender and receiver sides of the channel.

I still want to take a closer look at when the waker cells can be set,
and I'd like to add more documentation there in a follow-up branch.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Nov 15, 2021
# 1.13.1 (November 15, 2021)

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
hawkw added a commit that referenced this pull request Nov 15, 2021
# 1.13.1 (November 15, 2021)

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
hawkw added a commit that referenced this pull request Nov 15, 2021
Depends on #4226

## Motivation

Currently, the safety invariants and synchronization strategy used in
`tokio::sync::oneshot` are not particularly obvious, especially to a new
reader. It would be nice to better document this code to make these
invariants clearer.

## Solution

This branch adds `SAFETY:` comments to the `oneshot` channel
implementation. In particular, I've focused on documenting the
invariants around when the inner `UnsafeCell` that stores the value can
be accessed by the sender and receiver sides of the channel.

I still want to take a closer look at when the waker cells can be set,
and I'd like to add more documentation there in a follow-up branch.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Nov 15, 2021
# 1.13.1 (November 15, 2021)

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
hawkw added a commit that referenced this pull request Nov 16, 2021
Depends on #4226

## Motivation

Currently, the safety invariants and synchronization strategy used in
`tokio::sync::oneshot` are not particularly obvious, especially to a new
reader. It would be nice to better document this code to make these
invariants clearer.

## Solution

This branch adds `SAFETY:` comments to the `oneshot` channel
implementation. In particular, I've focused on documenting the
invariants around when the inner `UnsafeCell` that stores the value can
be accessed by the sender and receiver sides of the channel.

I still want to take a closer look at when the waker cells can be set,
and I'd like to add more documentation there in a follow-up branch.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.8.4 (November 15, 2021)

This release backports a bug fix from 1.13.1.

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.8.4 (November 15, 2021)

This release backports a bug fix from 1.13.1.

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.14.0 (November 15, 2021)

### Fixed

- macros: fix compiler errors when using `mut` patterns in `select!`
  ([#4211])
- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])
- sync: make `AtomicWaker` panic safe ([#3689])
- runtime: fix basic scheduler dropping tasks outside a runtime context
  ([#4213])

### Added

- stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223])

### Changed

- io: updated `copy` buffer size to match `std::io::copy` ([#4209])

### Documented

- io: rename buffer to file in doc-test ([#4230])
- sync: fix Notify example ([#4212])

[#4211]: #4211
[#4226]: #4226
[#3689]: #3689
[#4213]: #4213
[#4179]: #4179
[#4223]: #4223
[#4209]: #4209
[#4230]: #4230
[#4212]: #4212
hawkw added a commit that referenced this pull request Nov 16, 2021
Depends on #4226

## Motivation

Currently, the safety invariants and synchronization strategy used in
`tokio::sync::oneshot` are not particularly obvious, especially to a new
reader. It would be nice to better document this code to make these
invariants clearer.

## Solution

This branch adds `SAFETY:` comments to the `oneshot` channel
implementation. In particular, I've focused on documenting the
invariants around when the inner `UnsafeCell` that stores the value can
be accessed by the sender and receiver sides of the channel.

I still want to take a closer look at when the waker cells can be set,
and I'd like to add more documentation there in a follow-up branch.

Signed-off-by: Eliza Weisman <eliza@buoyant.io>
hawkw added a commit that referenced this pull request Nov 16, 2021
# 1.8.4 (November 15, 2021)

This release backports a bug fix from 1.13.1.

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed ([#4226])

[#4226]: #4226
fengalin pushed a commit to fengalin/tokio that referenced this pull request Nov 17, 2021
fengalin pushed a commit to fengalin/tokio that referenced this pull request Nov 17, 2021
gstreamer-github pushed a commit to sdroege/gst-plugin-rs that referenced this pull request Nov 17, 2021
A data race condition was discovered in tokio, which can lead
to memory corruption. This vulnerability affects our fork.

See:

- https://rustsec.org/advisories/RUSTSEC-2021-0124
- tokio-rs/tokio#4225
- tokio-rs/tokio#4226
- fengalin/tokio#1

Fixes: https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs/-/issues/174
}
// TODO: This could be `Release`, followed by an `Acquire` fence *if*
// the `RX_TASK_SET` flag is set. However, `loom` does not support
// fences yet.
Copy link
Contributor

@cynecx cynecx Nov 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think loom has partial support for fences iiuc (tokio-rs/loom#220), so technically this comment isn't correct anymore :D.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think there might be a couple other places in Tokio where we also can implement some things nicer now that loom supports fences. I didn't want to actually change that aspect of the implementation in the bugfix branch, though. Might be worth doing in a subsequent one!

kawadakk added a commit to solid-rs/tokio that referenced this pull request Nov 18, 2021
# 1.8.4 (November 15, 2021)

This release backports a bug fix from 1.13.1.

### Fixed

- sync: fix a data race between `oneshot::Sender::send` and awaiting a
  `oneshot::Receiver` when the oneshot has been closed (tokio-rs#4226)
@Arnavion
Copy link
Contributor

Arnavion commented Jan 4, 2022

Is there any plan to backport this to 0.1.x ?

@Darksonn
Copy link
Contributor

Darksonn commented Jan 4, 2022

We don't have any plans, no.

@Arnavion
Copy link
Contributor

Arnavion commented Jan 4, 2022

Okay, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Category: This is a bug. I-crash Problems and improvements related to program crashes/panics. M-sync Module: tokio/sync R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Race leads to panic in oneshot::Sender::send()
4 participants